The data science industry, at the intersection of statistics, computer science, and business analysis, has rapidly grown into a critical field for data-driven decision-making and innovation. It plays a pivotal role in various sectors, including technology, finance, healthcare, and retail, among others. The industry’s growth is fueled by the increasing generation of data and the need for sophisticated tools and methodologies to extract insights and inform business strategies. Data scientists are highly sought after for their ability to analyze complex datasets, create predictive models, and communicate findings effectively. The evolving nature of the industry, marked by advancements in machine learning, artificial intelligence, and big data technologies, continues to expand the scope and impact of data science roles, making it a dynamic and future-focused career field.
Understanding salary levels, especially in the field of data science, is valuable for several reasons, both from an individual’s career perspective and from an organizational standpoint. Research Question of the analysis: What factors contribute to the top quartile of data science salaries? Salary_in_usd variable will be used as a main variable in order to ensure standarization and comparability of data for different countries.
This report consists of several parts, including data evaluation, explanatory data analysis and advanced analysis.
Data used for this analysis was obtained from Kaggle - a platform for data scientists (link:https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023/data) in a form of a .CSV file. Uploading data:
There are no missing values in the dataset.
## [1] 0
| Statistic | Value |
|---|---|
| Standard Deviation | 63055.63 |
| Variance | 3976011879.23 |
| Interquartile Range (IQR) | 80000.00 |
| Lower Bound for Outliers | -25000.00 |
| Upper Bound for Outliers | 295000.00 |
Interpretation of the values: The high variance value reiterates that there’s a substantial spread in the salary data, indicating diverse salary ranges within the field. An IQR of 80,000 USD suggests that the central half of the data has a wide salary range, emphasizing the diversity in compensation across different data science positions. When it comes to lower bound for outliers, theoretically, this threshold is at -25,000 USD, but practically, negative salaries are not feasible. This implies that there are few to no extreme low outliers in the salary data. Next, upper bound for outlierS shows that salaries above 295,000 USD are outliers, signifying extremely high-paying roles in the data science industry. These might be associated with highly specialized skills, leadership roles, or specific high-paying industries or regions. In conclusion, the data shows a broad range of salaries within the field of data science, indicating a diverse industry with varying levels of compensation. This range can be attributed to factors such as geographical location, level of education, experience, and specific job roles. The presence of high-paying outliers suggests opportunities for significantly lucrative roles in the industry. Understanding these salary dynamics is crucial for both professionals navigating their career paths and organizations structuring their compensation strategies.
Below there is a boxplot illustrating the aforementioned statistics. The presence of outliers, indicated by points beyond the “whiskers” of the boxplot, suggests significant variation in the upper range of salaries. These outliers could represent highly specialized roles, exceptionally experienced individuals, or specific industries within data science where salaries are markedly higher.
The frequency tables below show several interesting trends: 1. The most frequent experience level is Senior; 2. Most of employees work full time; 3. There is prevalence of medium sized companies; 4. The most frequent job titles are: Data Engineer, Data Scientist, Data Analyst, followed by Machine Learning Engineer, Analytics Engineer and Data Architect. There are many job titles which occur only once or twice. 5. When it comes to employee residence, there is an obvious trend - most data regards United States of America. 6. Unsurprisingly, most companies are located in United States of America.
Such huge differences in terms of frequency create difficulty in further analysis. For this reason, both job titles and employee residence values will be respectively categorized according to the career fields and regions.
| Value | Frequency |
|---|---|
| SE | 2516 |
| MI | 805 |
| EN | 320 |
| EX | 114 |
| Value | Frequency |
|---|---|
| FT | 3718 |
| PT | 17 |
| CT | 10 |
| FL | 10 |
| Value | Frequency |
|---|---|
| M | 3153 |
| L | 454 |
| S | 148 |
In order to assess the frequency of job titles they were assigned their frequencies in the table below. Clearly, there are some prevalent job titles, but what is interesting, many job titles are unique. It can be due to different phrasing, which would mean the role is the same, but the job title is diffrent. On the other hand, uniqueness of some job titles may be due to highly specialized and innovative roles.
| Frequency | Value | Count |
|---|---|---|
| 1040 | Data Engineer | 1 |
| 840 | Data Scientist | 1 |
| 612 | Data Analyst | 1 |
| 289 | Machine Learning Engineer | 1 |
| 103 | Analytics Engineer | 1 |
| 101 | Data Architect | 1 |
| 82 | Research Scientist | 1 |
| 58 | Applied Scientist, Data Science Manager | 2 |
| 37 | Research Engineer | 1 |
| 34 | ML Engineer | 1 |
| 29 | Data Manager | 1 |
| 26 | Machine Learning Scientist | 1 |
| 24 | Data Science Consultant | 1 |
| 22 | Data Analytics Manager | 1 |
| 18 | Computer Vision Engineer | 1 |
| 16 | AI Scientist | 1 |
| 15 | BI Data Analyst, Business Data Analyst | 2 |
| 14 | Data Specialist | 1 |
| 13 | BI Developer | 1 |
| 12 | Applied Machine Learning Scientist | 1 |
| 11 | AI Developer, Big Data Engineer, Director of Data Science, Machine Learning Infrastructure Engineer | 4 |
| 10 | Applied Data Scientist, Data Operations Engineer, ETL Developer, Head of Data, Machine Learning Software Engineer | 5 |
| 9 | BI Analyst, Head of Data Science, Lead Data Scientist | 3 |
| 8 | Data Science Lead, Principal Data Scientist | 2 |
| 7 | Data Quality Analyst, Machine Learning Developer, NLP Engineer | 3 |
| 6 | Data Analytics Engineer, Data Infrastructure Engineer, Deep Learning Engineer, Lead Data Engineer, Machine Learning Researcher | 5 |
| 5 | Cloud Database Engineer, Computer Vision Software Engineer, Data Science Engineer, Lead Data Analyst, Product Data Analyst | 5 |
| 4 | 3D Computer Vision Researcher, Business Intelligence Engineer, Data Operations Analyst, Machine Learning Research Engineer, MLOps Engineer | 5 |
| 3 | Cloud Data Engineer, Financial Data Analyst, Lead Machine Learning Engineer, Machine Learning Manager | 4 |
| 2 | AI Programmer, Applied Machine Learning Engineer, Autonomous Vehicle Technician, Big Data Architect, Data Analytics Consultant, Data Analytics Lead, Data Analytics Specialist, Data Lead, Data Modeler, Data Scientist Lead, Data Strategist, ETL Engineer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst, Principal Data Engineer, Software Data Engineer | 17 |
| 1 | Azure Data Engineer, BI Data Engineer, Cloud Data Architect, Compliance Data Analyst, Data DevOps Engineer, Data Management Specialist, Data Science Tech Lead, Deep Learning Researcher, Finance Data Analyst, Head of Machine Learning, Manager Data Management, Marketing Data Engineer, Power BI Developer, Principal Data Architect, Principal Machine Learning Engineer, Product Data Scientist, Staff Data Analyst, Staff Data Scientist | 18 |
| Frequency | Value | Count |
|---|---|---|
| 3004 | US | 1 |
| 167 | GB | 1 |
| 85 | CA | 1 |
| 80 | ES | 1 |
| 71 | IN | 1 |
| 48 | DE | 1 |
| 38 | FR | 1 |
| 18 | BR, PT | 2 |
| 16 | GR | 1 |
| 15 | NL | 1 |
| 11 | AU | 1 |
| 10 | MX | 1 |
| 8 | IT, PK | 2 |
| 7 | IE, JP, NG | 3 |
| 6 | AR, AT, PL | 3 |
| 5 | BE, PR, SG, TR | 4 |
| 4 | CH, CO, LV, RU, SI, UA | 6 |
| 3 | AE, BO, DK, HR, HU, RO, TH, VN | 8 |
| 2 | AS, CF, CL, CZ, FI, GH, HK, KE, LT, PH, SE, UZ | 12 |
| 1 | AM, BA, BG, CN, CR, CY, DO, DZ, EE, EG, HN, ID, IL, IQ, IR, JE, KW, LU, MA, MD, MK, MT, MY, NZ, RS, SK, TN | 27 |
| Frequency | Value | Count |
|---|---|---|
| 3224 | USD | 1 |
| 236 | EUR | 1 |
| 161 | GBP | 1 |
| 60 | INR | 1 |
| 25 | CAD | 1 |
| 9 | AUD | 1 |
| 6 | BRL, SGD | 2 |
| 5 | PLN | 1 |
| 4 | CHF | 1 |
| 3 | DKK, HUF, JPY, TRY | 4 |
| 2 | THB | 1 |
| 1 | CLP, CZK, HKD, ILS, MXN | 5 |
| Frequency | Value | Count |
|---|---|---|
| 3040 | US | 1 |
| 172 | GB | 1 |
| 87 | CA | 1 |
| 77 | ES | 1 |
| 58 | IN | 1 |
| 56 | DE | 1 |
| 34 | FR | 1 |
| 15 | BR | 1 |
| 14 | AU, GR, PT | 3 |
| 13 | NL | 1 |
| 10 | MX | 1 |
| 7 | IE | 1 |
| 6 | AT, JP, SG | 3 |
| 5 | CH, NG, PL, TR | 4 |
| 4 | BE, CO, DK, IT, LV, PK, PR, SI, UA | 9 |
| 3 | AE, AR, AS, CZ, FI, HR, LU, RU, TH | 9 |
| 2 | CF, EE, GH, HU, ID, IL, KE, LT, RO, SE | 10 |
| 1 | AL, AM, BA, BO, BS, CL, CN, CR, DZ, EG, HK, HN, IQ, IR, MA, MD, MK, MT, MY, NZ, PH, SK, VN | 23 |
Below there are listed all types of values in the dsalaries dataset.
| Variable | Description |
|---|---|
| work_year | Year of the work data |
| experience_level | Level of experience |
| employment_type | Type of employment |
| job_title | Title of the job |
| salary | Salary in local currency |
| salary_currency | Currency of the salary |
| salary_in_usd | Salary in USD |
| employee_residence | Country of residence of the employee |
| remote_ratio | Percentage of work done remotely |
| company_location | Location of the company |
| company_size | Size of the company |
For the sake of insightful analysis, employee residence variable is categorized according to region. As Europe and North America have the most frequencies, and other regions have very few occurences, they were combined into category: “Other Regions”. This category will include: South America, Africa and Oceania.
| Region | Employee Residence |
|---|---|
| Europe | AL, AD, AT, BA, BE, BG, BY, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, JE, LT, LU, LV, MC, MD, ME, MK, MT, NL, NO, PL, PT, RO, RS, RU, SE, SI, SK, TR, UA |
| Other Regions | AE, AM, CN, HK, ID, IL, IN, IQ, IR, JP, KW, MY, PH, PK, SG, TH, UZ, VN, AR, BO, BR, CL, CO, PE, UY, VE, CF, DZ, EG, GH, KE, MA, NG, TN, AS, AU, NZ |
| North America | CA, CR, DO, HN, MX, PR, US |
Because of high number of different job titles and for the ease of further analysis, the job titles were categorized into following groups. Such criteria as: relevance to core skills and responsibilities, industry-standard role definitions, overlap with related fields, hierarchical and management aspects and specialization or unique focus were taken into account. The category: “Emerging Technologies and Specialized Roles” is important, as it contains many job titles, but they don’t occur frequently. Because of their degree of specialization it is crucial to include them in the analysis, especially to assess if niche roles can have high salaries.
| Category | Job Titles |
|---|---|
| Data Engineering & Architecture | Data Engineer, Data Architect, Big Data Engineer, Data Infrastructure Engineer, Data Operations Engineer, AI Developer, Director of Data Science, Cloud Database Engineer, Lead Data Engineer, Cloud Data Engineer, Principal Data Engineer, Software Data Engineer |
| Data Science & Analytics | Data Scientist, Data Analyst, Applied Scientist, Applied Data Scientist, Data Science Manager, Data Science Engineer, Data Manager, Data Science Consultant, Data Analytics Manager, BI Data Analyst, Business Data Analyst, Data Specialist, BI Developer, BI Analyst, Head of Data Science, Head of Data, Lead Data Scientist, Data Science Lead, Principal Data Scientist, Data Quality Analyst, NLP Engineer, Lead Data Analyst, Lead Data Scientist, Product Data Analyst, Data Operations Analyst, Cloud Data Engineer, Financial Data Analyst, Lead Machine Learning Engineer, Machine Learning Manager, Data Analytics Consultant, Data Analytics Lead, Data Analytics Specialist, Data Analytics Engineer, Data Lead, Data Modeler, Data Scientist Lead, Data Strategist, ETL Engineer, ETL Developer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst |
| Machine Learning & Advanced Research | Machine Learning Developer, Applied Machine Learning Engineer, Machine Learning Engineer, Analytics Engineer, Research Scientist, Research Engineer, ML Engineer, Machine Learning Scientist, Machine Learning Software Engineer, Machine Learning Research Engineer, Applied Machine Learning Scientist, Big Data Engineer, Director of Data Science, Machine Learning Infrastructure Engineer, Machine Learning Researcher |
| Emerging Technologies & Specialized Roles | AI Developer, AI Scientist, AI Programmer, Applied Scientist, Data Science Manager, Deep Learning Engineer, Machine Learning Researcher, 3D Computer Vision Researcher, Business Intelligence Engineer, Azure Data Engineer, BI Data Engineer, Cloud Data Architect, Compliance Data Analyst, Data DevOps Engineer, Data Management Specialist, Data Science Tech Lead, Deep Learning Researcher, Finance Data Analyst, Head of Machine Learning, Manager Data Management, Marketing Data Engineer, Power BI Developer, Principal Data Architect, Principal Machine Learning Engineer, Product Data Scientist, Staff Data Analyst, Staff Data Scientist, Autonomous Vehicle Technician, Big Data Architect, Data Lead, Data Modeler, Data Strategist, ETL Engineer, Insight Analyst, Marketing Data Analyst, Principal Data Analyst, Computer Vision Engineer, Computer Vision Software Engineer, MLOps Engineer |
Below there is a barplot illustrating the frequency of job title categories. Creation of category ‘Emerging Technologies & Specialized Roles’ aims to examine the possibility of high salaries among the most unique, innovative and specialized roles in data science field.
The summary of the dataset dsalaries reveals some important observations, for instance: - wide range in salary figures; - diversity in remote work arrangements; - high maximum salaries etc.
Some of this observations will be further explored in next parts of this report.
| Variable | X | X.1 | X.2 | X.3 | X.4 | X.5 |
|---|---|---|---|---|---|---|
| work_year | Min. :2020 | 1st Qu.:2022 | Median :2022 | Mean :2022 | 3rd Qu.:2023 | Max. :2023 |
| experience_level | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| employment_type | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| job_title | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| salary | Min. : 6000 | 1st Qu.: 100000 | Median : 138000 | Mean : 190696 | 3rd Qu.: 180000 | Max. :30400000 |
| salary_currency | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| salary_in_usd | Min. : 5132 | 1st Qu.: 95000 | Median :135000 | Mean :137570 | 3rd Qu.:175000 | Max. :450000 |
| employee_residence | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| remote_ratio | Min. : 0.00 | 1st Qu.: 0.00 | Median : 0.00 | Mean : 46.27 | 3rd Qu.:100.00 | Max. :100.00 |
| company_location | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| company_size | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| region | Length:3755 | Class :character | Mode :character | NA | NA | NA |
| job_title_category | Length:3755 | Class :character | Mode :character | NA | NA | NA |
This histogram displaying density and distribution od salaries in USD shows clearly that most salaries fall roughly between 100,000 USD and 200,000 USD. The peak of the density plot aligns with the salary range where the highest number of data points are found.
The summary of dsalaries dataset reveals that the lowest salary is 5132 USD and maximum one is 450000 USD. It can be said that the extent to which the density plot deviates from the center (median) can indicate skewness in the salary data. For instance, a long tail on the right of the density plot would suggest that a smaller number of individuals have salaries significantly higher than the median, which is consistent with the wide range observed.
The correlation heatmap shows, that in case of remote ratio, salary (in currencies from respective employee residence) and work year, the impact on salary in USD variable is seemingly negligible.
This interactive violin plot for salary in USD vs. experience level enables examination of all statistics for different categories of experience level. Some general interpretations can be deduced: - Larger Company Compensation: The presence of higher and more varying salaries in larger companies could reflect the broader scope of roles, availability of resources, and the ability to pay for highly specialized skills. - Smaller Company Dynamics: Smaller companies showing a narrower range of salaries could be due to a number of factors including less role differentiation, budget constraints, or a more unified salary structure. - Overall Salary Trends: The fact that outliers are present across all company sizes indicates that exceptionally high salaries are not exclusive to any particular company size and could be influenced more by individual role, skill level, or negotiation.
Analysis of the Boxplot:
Variability in Salary: There’s a noticeable difference in the interquartile range (IQR) – the box part of the boxplot – among the three company sizes. Large companies have a wider IQR compared to medium and small companies, suggesting more variability in the salaries offered by larger companies.
Median Salary Comparison: The median salary – indicated by the line in the middle of each box – appears to be highest in large companies, followed by medium and then small companies. This trend is a common observation in the industry, as larger companies often have more resources to offer competitive salaries.
Outliers: There are numerous outliers for large and medium companies, represented by the individual dots outside the upper whiskers of the boxplot. This suggests that within these company sizes, there are positions that command exceptionally high salaries, possibly due to specialized skills, senior roles, or other factors.
Lower Salary Range: Small companies show a compact box with fewer outliers, which could indicate a more uniform salary structure with less deviation from the median salary.
This scatter plot visualizes the average salaries for data science roles based on the employee’s residence within each region: Europe, North America, and Other Regions. Here’s a succinct analysis: - Europe: Shows a cluster of average salaries with a tight range, suggesting less variability in pay across different countries within the region. - North America: Displays higher average salaries than Europe, with a spread indicating that some residences in North America have significantly higher average salaries than others. - Other Regions: There’s a wide spread in average salaries, with some residences showing comparable averages to North America, potentially indicating the presence of high-paying countries outside the traditional economic centers. The plot underscores the regional disparities in average data science salaries and suggests that residence within these regions can be a strong indicator of salary expectations.
Overall, the plot shows that more specialized and advanced job title groups tend to have a higher variation in salaries, with the potential for significantly higher pay. Here are some conclusions: Here’s a succinct analysis: - Data Engineering & Architecture: Shows a moderate median salary with a relatively compact interquartile range (IQR), indicating consistency in salaries within this group. - Data Science & Analytics: This group has a similar median salary to Data Engineering & Architecture but a slightly wider IQR, suggesting more variation in pay. - Emerging Technologies & Specialized Roles: This category exhibits a wider IQR and higher median salary, which could reflect the high demand and compensation for specialized skills. - Machine Learning & Advanced Research: Has the widest IQR, indicating a significant spread in salaries, with some very high outliers, reflecting the premium paid for advanced ML expertise and research roles.
The scatter plot displays the average salaries for various job titles within each job category in the data science field. Here’s a succinct analysis: - Data Engineering & Architecture: This category shows a cluster of job titles with average salaries mostly in the lower to middle salary range, suggesting that while important, these roles may not command the highest salaries. - Data Science & Analytics: There’s a broad distribution of average salaries, indicating variability in compensation which may reflect a range of specializations and responsibilities within this category. - Emerging Technologies & Specialized Roles: The average salaries are dispersed across a wide range, with several job titles commanding higher average salaries, highlighting the value of niche skills in the market. - Machine Learning & Advanced Research: This category shows a concentration of higher average salaries, underscoring the industry’s demand for advanced technical skills and research capabilities.
The plot illustrates regional disparities in the growth rate of data science salaries, highlighting North America as the leader in compensation for these roles.Here’s a brief analysis: - North America shows a consistent upward trend, maintaining the highest average salary across the observed years, which underscores the region’s strong market for data science roles. Europe also displays an upward trend but starts and remains below North America’s average salary throughout the years, indicating a growing but less lucrative market compared to North America. - Other Regions exhibit t- he lowest average salaries, with a slight increase over time. This group may encompass regions with emerging markets for data science and varying economic conditions.
he line graph visualizes the trends of average salaries over several years, categorized by job title groups within the data science industry: - Data Engineering & Architecture: This category shows substantial growth in average salaries over the years, indicating an increasing valuation of skills in this area; - Data Science & Analytics: Starting from a higher baseline in 2020, this group also sees a consistent rise in average salaries, which may reflect the ongoing demand for data science expertise; - Emerging Technologies & Specialized Roles: There’s a sharp increase from 2020 to 2023, suggesting a rapid growth in compensation, possibly due to the scarcity of cutting-edge technical skills as they become more in demand. - Machine Learning & Advanced Research: With a significant leap in average salaries, this group tops the chart by 2023, underscoring the premium placed on advanced research and machine learning skills.
The graph indicates that while all areas are experiencing salary growth, the most pronounced increases are in specialized fields, which aligns with the high demand for advanced skill sets in the evolving data science market.
Variables salary_in_usd, work_year, and remote_ratio were chosen for clustering. These variables are scaled to ensure that they contribute equally to the clustering process.
Analysis of the Elbow Method Plot: - Decreasing WSS: Initially, as k
increases, there is a steep decline in WSS, indicating significant gains
from increasing the number of clusters. - Elbow Point: The “elbow” of
the plot appears to be at k = 4, where the rate of decrease sharply
diminishes. This inflection point suggests that adding more clusters
beyond this number results in diminishing returns in terms of WSS
reduction. - Optimal Clusters: Based on this plot, k = 4 is identified
as the optimal number of clusters for the data because it represents a
balance between minimizing WSS and avoiding overfitting with too many
clusters.
K-means clustering algorithm was run with k set to 4. Clusters have been defined based on similarities in salary, work year, and remote work ratio.
| Cluster | Count |
|---|---|
| 1 | 1298 |
| 2 | 631 |
| 3 | 1207 |
| 4 | 619 |
| Cluster | Salary..Scaled. | Work.Year..Scaled. | Remote.Ratio..Scaled. |
|---|---|---|---|
| 1 | -0.3885587 | 0.2986313 | -0.9301078 |
| 2 | 1.2550413 | 0.4566456 | -0.8919661 |
| 3 | 0.2802829 | 0.1498031 | 1.0963928 |
| 4 | -1.0111200 | -1.3838113 | 0.7217518 |
Values above indicate, that:
Cluster 1: represents employees with lower salaries, relatively newer roles, and infrequent remote work. The negative values in Salary (Scaled) and Remote Ratio (Scaled) indicate lower salaries and infrequent remote work, while the positive value in Work Year (Scaled) suggests relatively newer employees in their roles.
Cluster 2: represents highly paid employees with slightly more experience and a low remote work ratio. The high positive value in Salary (Scaled) suggests higher salaries, and the positive value in Work Year (Scaled) indicates more experience. The negative value in Remote Ratio (Scaled) suggests a low tendency for remote work, possibly indicating in-office high-level professionals.
Cluster 3: represents employees with average salaries, average experience, and frequent remote work. The positive value in Salary (Scaled) suggests average salaries, and the positive value in Remote Ratio (Scaled) indicates a high tendency for remote work, possibly indicating remote workers or freelancers.
Cluster 4: represents the least paid, least experienced employees with a higher tendency for remote work. The highly negative values in both Salary (Scaled) and Work Year (Scaled) indicate lower salaries and less experience. The positive value in Remote Ratio (Scaled) suggests a higher tendency for remote work, possibly indicating entry-level or intern positions that offer remote work options.
The scatter plot visualizes the first two principal components obtained from a PCA (Principal Component Analysis). The points are colored according to the four clusters identified by the k-means algorithm. - Distinct Clusters: The PCA plot shows that the four clusters are distinct, as they are spread out across the first two principal components, which are the dimensions capturing the most variance. - Cluster Overlap: There appears to be some overlap between the clusters, particularly between clusters 1 and 2. This suggests some similarity between these groups in the multidimensional space of the original variables. - PCA Effectiveness: The clear separation of clusters along PC1 and PC2 indicates that PCA is effective in reducing dimensionality while still preserving the structure necessary for cluster differentiation.
| cluster | work_year | salary | salary_in_usd | remote_ratio |
|---|---|---|---|---|
| 1 | 2022.580 | 125331.6 | 113069.58 | 1.078582 |
| 2 | 2022.689 | 218123.8 | 216707.80 | 2.931854 |
| 3 | 2022.477 | 161147.3 | 155243.80 | 99.544325 |
| 4 | 2021.417 | 357416.1 | 73813.58 | 81.340872 |
Work_Year: Cluster 1: The average work year is approximately 2022.58, suggesting most data points are from around mid-2022. Cluster 2: Slightly later, with an average work year around late 2022 (2022.689). Cluster 3: Similar to Cluster 1, with an average year around early to mid-2022 (2022.477). Cluster 4: The average year falls around early 2021 (2021.417), indicating this cluster contains older data.
Salary: Clusters 1, 2, and 3 have average salaries of approximately 125,332, 218,124, and 161,147 respectively. These figures represent the average salary without considering currency differences. Cluster 4 has a significantly higher average salary of around 357,416.
Salary_in_USD: This column normalizes salaries across clusters to US dollars, facilitating direct comparison. Clusters 1, 2, and 3 have average salaries in USD of around 113,070, 216,708, and 155,244 respectively. Cluster 4, despite having the highest average nominal salary, has a lower average when converted to USD (73,814), suggesting this cluster might contain data from countries with higher nominal salaries but lower value in USD.
Remote_Ratio: For Cluster 1, the ratio is around 1.08, suggesting very low remote work prevalence. Cluster 2 has a slightly higher ratio of around 2.93, indicating a marginal increase in remote work. Cluster 3 shows a significant jump, with a ratio of 99.54, suggesting almost entirely remote work. Cluster 4 also indicates high remote work prevalence (81.34), although not as high as Cluster 3.
Based on this plot, one might choose k=4 for clustering as it appears
to be the point after which the reductions in WSS become less
significant, indicating that additional clusters do not contribute much
to explaining the variance.
Overall, the clustering suggests that salaries in the data science field are influenced by job title and geographic location, with significant variance between different clusters. The larger clusters likely represent more common salary ranges and roles, while the smaller clusters may reflect specialized or regional characteristics of the data science job market.
Cluster 4: This is the largest cluster with 3006 individuals, indicating it may represent the most common salary range and job characteristics within the dataset. The average salary is relatively high at approximately $153,005, with a median close to $145,000, suggesting a strong central concentration around this salary level. The wide salary range indicates significant diversity within this group. The most common job title is “Data Engineer,” and the most common residence is the United States, which could imply that data engineering is a lucrative and common role in the US data science job market.
Cluster 3: Comprising 733 individuals, this cluster has a lower average salary of around $76,055 and a median of $65,000, which might reflect early to mid-career positions. The salary range is also wide, potentially indicating a variety of job roles within this cluster. The prevalent job title is “Data Scientist,” and the top residence is Great Britain, suggesting that data science roles in GB are diverse and possibly include a range from junior to senior positions.
Cluster 1: This is a very small cluster with only 14 individuals, which could represent a niche or specialized segment within the data science market. The average salary is around $60,800, with a narrower salary range. The primary job title is “Cloud Data Engineer,” and the top residence is Argentina, indicating a specific market or demand for cloud engineering expertise in that region.
Cluster 2: The smallest cluster, with just 2 individuals, has an average and median salary of $22,500, which is substantially lower than the other clusters. This may indicate entry-level positions or roles in regions with lower salary scales. The job title “Compliance Data Analyst” and the residence in Nigeria suggest these might be specialized roles in a particular sector or locale.
| cluster | Count | Average_Salary | Median_Salary | Salary_Range | Top_Job_Title | Top_Residence |
|---|---|---|---|---|---|---|
| 4 | 3006 | 153004.70 | 145000 | 426000 | Data Engineer | US |
| 3 | 733 | 76055.23 | 65000 | 294868 | Data Scientist | GB |
| 1 | 14 | 60800.57 | 55000 | 148000 | Cloud Data Engineer | AR |
| 2 | 2 | 22500.00 | 22500 | 15000 | Compliance Data Analyst | NG |
Here’s an interpretation of the PCA plot:
Variance Explained: Both PC1 and PC2 explain a very small amount of the variance (1% each). This suggests that these two components do not capture the majority of the information in the dataset. The low variance explained by the first two principal components may indicate that the dataset is high-dimensional or that the variability is spread out over many variables.
Cluster Distribution: Despite the low variance explained, the clusters appear to be differentiated along the PC1 axis, though there is considerable overlap along the PC2 axis. This could mean that the feature or combination of features that most strongly define the clusters are captured by PC1.
Cluster Overlap: The significant overlap of clusters, especially along PC2, suggests that the clusters are not entirely distinct in the first two principal component dimensions. This might imply that the clusters are not well-separated in the higher-dimensional space or that more components are needed to achieve clear separation.
The silhouette plot reveals an unexpected insight into the clustering structure of the data science salaries dataset. The elbow method suggested four clusters as optimal for our dataset. Contrarily, the silhouette analysis showed mixed results: One cluster exhibits high silhouette scores, indicating strong internal agreement. Another cluster presents significant negative values, suggesting poor fit within the cluster. This discrepancy implies that: 1. The k-means assumption of spherical clusters may not hold for this data. 2. The actual structure of the data might be more complex than k-means can capture.
Below there are decision trees regarding every region: Europe, North America and Other Regions. They display decisions based on experience level and job title category. Salaries are presented as 10000 USD.
Below there is a plot displaying the meaning of colors - which quartile and which level of salary in USD they represent.
Below there is a table with explanation of abbreviations used in decision trees.
| Abbreviation | |
|---|---|
| Data Engineering & Architecture | DEA |
| Data Science & Analytics | DSA |
| Emerging Technologies & Specialized Roles | ETSR |
| Machine Learning & Advanced Research | MLAR |
This report aimed to answer the research question: “What factors contribute to the top quartile of data science salaries?” Based on the analysis conducted, several key factors influencing salaries in the data science industry were identified.
Experience Level and Job Title Impact: Higher salaries are commonly associated with senior-level positions and specialized job titles such as Machine Learning & Advanced Research roles. The decision trees and clustering analysis highlighted that experience and job title categories significantly influence salary levels.
Geographical Variations: The salary distribution varied significantly across regions. North America generally offered higher salaries compared to Europe and other regions. This trend was evident in both the boxplots and the average salary trends over the years.
Company Size: Larger companies tended to offer higher and more varying salaries. This was visible in the boxplot analysis where larger companies had a wider interquartile range, indicating diverse compensation strategies.
Emerging Technologies and Specializations: Job titles within the “Emerging Technologies & Specialized Roles” category often correlated with higher salaries. This suggests that niche skills and innovative roles are highly valued in the industry.
Remote Work Flexibility: The analysis showed varying impacts of remote work on salaries. While some high-paying roles offered flexibility, there was no consistent trend indicating a direct correlation between remote work and higher salaries.
In conclusion, while the data science field offers diverse and lucrative career opportunities, factors such as experience level, job title, geographic location, and company size play crucial roles in determining salary levels. Continuous learning and adaptation to industry trends are key for professionals aiming to reach the top quartile of data science salaries.
`